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Abstract — We  present  a  new  method  for  obtaining  local  error  bars,  i.e.,  estimates  of 
the  confidence  in  the  predicted  value  that  depend  on  the  input.  We  approach  this  prob¬ 
lem  of  nonlinear  regression  in  a  maximum  likelihood  framework.  We  demonstrate  our 
technique  first  on  computer  generated  data  with  locally  varying,  normally  distributed 
target  noise.  We  then  apply  it  to  the  laser  data  from  the  Santa  Fe  Time  Series  Compe¬ 
tition.  Finally,  we  extend  the  technique  to  estimate  error  bars  for  iterated  predictions, 
and  apply  it  to  the  exact  competition  task  where  it  gives  the  best  performance  to  date. 


1  Obtaining  Error  Bars  Using  a  Maximum  Likelihood  Framework 

1.1  Motivation  and  Concept 

Feed-forward  artificial  neural  networks  are  widely  used  and  well-suited  for  nonlinear  regression.  They 
can  be  interpreted  as  predicting  the  expected  value  of  the  conditional  target  distribution  as  a  function 
of  (or  “conditioned  on”)  the  input  pattern  (e.g.,  Buntine  &  Weigend,  1991).  This  target  distribution  in 
response  to  each  input  may  also  be  viewed  as  an  error  model  (Rumelhart  et  a/.,  1994).  Often  an  estimate 
of  the  mean  of  this  conditional  target  distribution  suffices;  this  is  typically  done  using  one  output  unit. 
However,  we  here  present  a  method  that  gives  more  information  than  just  the  mean  of  that  distribution. 

Such  additional  information  could  be  obtained  by  attempting  to  estimate  the  entire  conditional  target 
distribution  with  connectionist  methods  (e.g.,  “fractional  binning,”  Srivastava  &  Weigend,  1994)  or  with 
non-connectionist  methods  such  as  a  Monte  Carlo  on  a  Hidden  Markov  Model  (Fraser  &  Dimitriadis, 
1994).  Nonparametric  estimates  of  the  shape  of  a  conditional  target  distribution  require  large  quantities 
of  data.  In  contrast,  our  less  data-hungry  method  assumes  a  specific  parameterized  form  of  the  conditional 
target  distribution  (e.g.,  Gaussian)  and  gives  us  the  value  of  the  error  bar  (e.g.,  the  width  of  the  Gaussian) 
by  finding  those  parameters  that  maximize  the  likelihood  of  the  data  given  the  model. 

Specifically,  when  we  use  a  network  output  y  to  approximate  a  function  /(x),  we  assume  that  the  desired 
output  ( d )  (i.e.,  the  observed  values)  can  be  modeled  by  d(x)  =  /(x)  •+  n(x)  where  n(x)  is  noise  drawn 
from  the  assumed  error  model  distribution.  Just  as  the  estimate  of  the  mean  of  this  distribution  y(x) 
is  a  function  of  the  input,  the  variance  a2  may  also  vary  as  a  function  of  the  location  in  input  space  x. 
When  the  noise  level  varies  over  the  input  space  (i.e.,  cr2(x)  depends  on  x;  it  is  not  a  constant),  not  only 
do  we  want  the  network  to  learn  an  output  function  y(x)  that  estimates  the  expectation  value  /i(x)  of 
the  conditional  target  distribution,  but  we  also  want  to  learn  a  function  ?;(x)  that  estimates  the  variance 
<r2(x)  of  that  distribution. 

Therefore,  we  simply  add  an  auxiliary  output  unit,  the  v-unit,  that  computes  i;(x),  our  estimate  of  <r2(x). 
Since  <r2(x)  must  be  positive,  we  choose  an  exponential  activation  function  for  i;(x)  to  naturally  impose 
this  bound:  t>(x)  =  exp  QT  wvkhk(x)  +  /?],  where  f3  is  the  offset  (or  “bias”),  and  wvk  is  the  weight 
between  hidden  unit  k  and  the  f-unit.  This  architecture  is  sketched  in  Figure  1. 


!  i - r 


Figure  1:  Architecture  of  the  network  with  two  output  units.  The  y-unit  predicts  the  conditional  mean 
(i.e.,  given  the  input)  of  the  output  distribution.  The  i;-unit  predicts  the  conditional  variance  of  that 
distribution.  All  weight  layers  have  full  connectivity.  This  architecture  allows  the  t?-unit  to  access  both 
information  in  the  input  pattern  itself  and  in  the  hidden  unit  representation  formed  while  learning  y(x). 
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Our  network  has  two  output  units.  For  one  of  them,  the  y-unit,  the  target  is  easily  available — it  is 
simply  given  by  d.  But  what  is  the  target  for  the  u-unit?  We  have  to  invent  a  target:  this  is  effectively 
done  by  maximizing  the  likelihood  of  our  network  M  given  the  data.  Using  Bayes’  rule  and  assuming 
statistical  independence  of  the  errors,  we  equivalently  minimize  the  negative  log  likelihood  or  cost  C  = 
—  ln  P {di\-x.i , N) .  Traditionally,  the  resulting  form  of  the  cost  C  involves  only  the  estimate  y(x,)  of 
the  mean  of  the  assumed  error  model  as  a  function  of  the  input:  the  variance  is  assumed  to  be  constant, 
and  constant  terms  drop  out  after  differentiation.  In  contrast,  we  here  allow  the  variance  to  depend  on 
the  input,  and  explicitly  keep  these  terms  in  C .  Given  any  network  architecture  and  any  error  model, 
the  appropriate  weight-update  equations  for  gradient  decent  learning  can  be  derived  straightforwardly. 
In  the  next  section,  we  illustrate  this  for  the  case  where  we  assume  that  the  deviations  of  the  observed 
value  from  the  mean  are  Gaussian  distributed,  and  the  variance  of  the  Gaussian  is  given  by  v(x). 

1.2  Gaussian  Error  Model 


Assuming  normally  distributed  errors  around  /(x)  corresponds  to  a  conditional  target  probability  distri¬ 
bution  of 

n/Jl  ,A  1  /  [di-Mxi)]2'1 

P(di\xi,N)  =  —7====  exp'  — 


2<t2(x,- 


(1) 


A/2?r<72(xi) 

with  y(x)  =  /(x).  Using  the  network  output  y(xj)  «  y(x2-)  to  estimate  the  mean,  and  v(x*)  <t2(x2)  to 

estimate  the  variance,  we  obtain  for  the  monotonically  related  negative  log  likelihood, 

i2 

(2) 


1„P«|  x„  AO  =  +  [*-»<*'■>] 


2  '  2u(xj) 

Dropping  constant  terms,  summation  over  all  patterns  i  yields  the  total  cost: 

i  ( [<U  -  y(*,)]3  ' 


v(x,) 


+  ln[v(xi)] 


(3) 


In  order  to  write  the  explicit  weight-update  equations,  we  have  to  make  assumptions  about  the  transfer 
function  of  the  units  in  the  network.  We  here  choose  linear  activation  function  for  the  y-unit,  tanh 
activation  functions  for  the  hidden  units,  and  an  exponential  activation  function  for  the  v-unik.  We  can 
then  take  derivatives  of  the  cost  C  with  respect  to  the  network  weights.  To  update  weights  connected  to 
the  y  and  v-units  we  have: 


1 

Awyj  =  V-7-Tidi-y{^i)]hj(xi) 

VyXi  J 

Awvk  =  rj  1  { [di  -  j/(x,-)]2  -  w(xi)}  hk(xi) 


(4) 

(5) 


where  rj  is  the  learning  rate.  For  weights  not  connected  to  the  output,  the  weight-update  equations  are 
derived  by  using  the  chain  rule  in  the  same  way  as  in  standard  backpropagation.  Note  that  Eq.  (5) 
is  equivalent  to  training  a  separate  function- approximation  network  where  the  targets  for  v(x)  are  the 
squared  errors.  Note  also  that  if  v(x*)  is  constant,  Eqs.  (3)~(4)  reduce  to  their  familiar  forms  for  standard 
backpropagation  with  a  sum-squared,  error  cost  function. 


A  variation  of  v (x,-)  with  i  implies  a  different  weighting  of  each  pattern.  The  l/u(x)  term  in  Eqs.  (4)-(5) 
can  be  interpreted  as  a  form  of  weighted  regression,  lowering  the  effective  learning  reate  in  high-noise 
regions.  The  effect  is  that  the  network  tries  hard  to  obtain  small  errors  on  those  patterns  where  it  can: 
it  tries  less  hard  on  the  patterns  for  which  the  expected  error  is  going  to  be  large  anyway. 


2  Mechanics 


If  the  weighted  regression  term  is  allowed  a  significant  influence  early  in  gradient  descent,  local  minima 
frequently  result:  the  network  consumes  all  its  resources  by  trying  hard  to  fit  the  first  statistical  feature 
it  happens  to  encounter  low  errors  on,  discounting  other  patterns  as  being  “high-variance.”  To  avoid 
premature  weighting  of  different  patterns  (that  would  be  based  on  inaccurate  values  of  v(xj)  before  /(x) 
is  at  least  roughly  approximated  by  y(x)),  we  separate  training  into  three  phases: 

Phase  I  (Mean):  Randomly  split  the  available  data  into  equal  halves,  sets  A  and  B.  Learn  the 
conditional  expectation  value  y(x)  by  using  set  A  as  training  set.  In  Phase  I  we  use  simple  gradient 
descent  on  a  simple  squared-error  cost  function,  i.e.,  Eqs.  (3)-(4)  without  the  l/t;(x)  terms.1  To  guard 
against  overfitting,  training  is  considered  complete  at  the  minimum  of  the  squared  error  on  the  cross- 
validation  set  B,  monitored  at  the  end  of  each  complete  pass  through  the  data. 

1  Further  details  are:  all  inputs  are  scaled  to  zero  mean  and  unit  variance.  All  initial  weights  feeding  into  hidden  units 
are  drawn  from  a  uniform  distribution  E  [l/«,  —  1/z]  where  i  is  the  number  of  incoming  connections.  All  initial  weights 
feeding  into  y  or  v  are  drawn  from  a  uniform  distribution  E  [s/i,  —s/i]  where  s  is  the  standard  deviation  of  the  (overall) 
target  distribution.  No  momentum  is  used,  and  all  weight  updates  are  averaged  over  the  forward  passes  of  20  patterns. 
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Phase  II  (Variance):  Attach  a  layer  of  hidden  units  connected  to  both  the  inputs  and  the  hidden  units 
of  the  network  from  Phase  I  (see  Figure  1).  Freeze  the  weights  trained  in  Phase  I,  and  train  the  v-unit 
to  predict  the  squared  errors,  again  using  simple  gradient  descent  as  in  Phase  I.  The  training  set  for  this 
phase  is  set  B,  with  set  A  used  for  cross-validation —  if  set  A  were  used  as  the  training  set  in  this  phase 
as  well,  possible  overfitting  in  Phase  I  could  result  in  seriously  underestimating  'u(x).  To  avoid  this  risk, 
we  interchange  the  data  sets.  The  initial  value  for  the  offset  /?  of  the  ^-unit  is  the  natural  logarithm  of 
the  mean  squared  error  (from  Phase  I)  of  set  B.  Phase  II  stops  when  the  squared  error  on  set  A  levels 
out  or  starts  to  increase. 

Phase  III  (Weighted  regression):  Re-split  the  available  data  into  two  new  halves,  A!  and  Bf .  Unfreeze 
all  weights  and  train  all  network  parameters  to  minimize  the  full  cost  function  C  using  Eqs.  (4)-(5)  and 
their  chain-rule  counterparts  on  set  A' .  Training  is  considered  complete  when  C  has  reached  its  minimum 
on  set  Bf .  Note  that  the  l/v(x)  terms  in  Eqs.  (4)-(5)  implement  a  form  of  weighted  regression,  attenuating 
the  learning  in  regions  where  the  estimated  i;(x)  is  high.  The  three-phase  approach  greatly  reduces  the 
probability  of  local  minima  by  giving  the  composite  network  a  head  start  on  learning  the  function  /(x). 


3  Example  #1:  Computer-Generated  Data 


To  demonstrate  the  application  this  method,  we  construct  a  one-dimensional  example  problem  where  the 
true  /(x)  and  <r2(x)  are  known.  We  take  the  equation 

f(x)  =  sin(cja£)  sin  (a >px)  =  -  [cos(ua  —  up)x  —  cos(u>a  +  up)x\  ,  with  ua  =  2.5  and  up  —  1.5  (6) 

over  the  interval  x  G  [0,tt].  We  generate  the  target  values  by  adding  normally  distributed  noise  n(x)  = 
iV(0,  a2{x)))  where  a2(x)  varies  according  to 

cr2(x)  —  0.01  +  0.25  x  [1  —  sin(u;a£)]2  .  (7) 

We  generate  1000  patterns  for  training,  and  an  additional  105  patterns  for  evaluation  after  training. 


Figure  2:  The  computer-generated  example: 

fa)  True  function  f(x)  and  the  1000  training  points; 

(b)  Histogram  of  residual  errors  (end  of  Phase  III); 

(c)  Ti*ue  f(x)  (solid  line)  and  network  output  y(x)  at  end  of  Phase  I  (dotted)  and  Phase  III;  also  shown 
is  an  example  of  a  local  minimum  resulting  from  starting  immediately  with  Phase  III. 

(d)  True  cr2(x)  and  network  output  u(a?)  at  end  of  Phase  II  and  Phase  III.  (Phase  I  value  is  mean  squared 
error.)  Legend  applies  to  both  (c)  and  (d). 


Training  follows  exactly  the  three  phases  described  above  with  the  following  details:  Phase  I  uses  a 
network  with  one  hidden  layer  of  10  tanh  units  and  a  conservatively  small  learning  rate  of  77  =  10“ 2 .  For 
Phase  II  we  add  an  auxiliary  layer  of  10  tanh  hidden  units  connected  to  the  i;-unit  (see  Figure  1)  and  use 
the  same  77.  In  Phase  III  the  composite  network  is  trained  by  gradient  descent  on  Eq.  (3)  with  77  =  10“4, 
minimizing  the  full  cost  (weighted  regression).  In  Figures  2b  and  2c  we  also  plot  a  local  minimum  resulting 
from  attempting  to  minimize  Eq.  (3)  directly:  the  model  misspecification  is  interpreted  as  a  high-noise 
region  and  effectively  ignored  in  learning. 

Note  the  kurtosis  (tall  center  and  heavy  tails)  of  the  distribution  of  the  residuals  (Figure  2b),  typical 
for  data  drawn  from  at  least  two  normal  distributions  with  the  same  mean  but  different  variances.  Note 
in  Figure  2c  how  the  Phase  I  estimate  is  closer  to  the  true  function  overall ,  but  misses  the  function  in 
the  low-noise  region  (for  0  <  x  <  1).  In  contrast,  the  Phase  III  weighted-regression  estimate  is  much 
closer  to  f(x)  for  0  <  x  <  1,  but  is  smoother  for  1.5  <  x  <  2.5  where  the  uncertainty  is  large  anyway. 
Figure  2d  shows  that  v(x)  after  the  final  Phase  III  accurately  estimates  cr2(x),  with  an  appropriate  slight 
overestimation  in  the  region  of  model  misspecification.  Table  1  illustrates  the  importance  of  the  fine 
tuning  that  Phase  III  performs  in  minimizing  the  negative  log  likelihood.  The  entries  in  the  table  show 
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that  the  network  is  able  to  predict  the  expected  error  for  this  noisy  system  very  well:  the  percentage  of 
residual  values  less  than  one  and  two  predicted  standard  deviations  is  very  close  to  the  theoretical  and 
optimally  achievable  values  for  this  data  set. 


row 

Train 

(TV  =  103) 

Test 

{N  =  105) 

NMSE 

cost 

NMSE 

cost 

1 

Phase  I 

0.599 

-0.039 

0.582 

-0.051 

2 

Phase  II 

0.599 

-0.395 

0.582 

-0.398 

3 

Phase  III 

0.604 

-0.498 

0.588 

-0.513 

4 

n(x )  ( exact  additive  noise) 

0.588 

- 0.542 

0.562 

- 0.581 

P 

P 

5 

p(yl/2(x),  residual  errors) 

0.579 

0.597 

6 

p(cr(x),  residual  errors) 

0.588 

0.605 

1  std 

2  std 

1  std 

2  std 

7 

%  of  errors  <  v1  f2  (x)\2v1/2(x) 

70.2 

95.3 

69.63 

95.16 

8 

%  of  errors  <  a(x);2cr(x) 

67.7 

95.2 

66. 76 

94.67 

9 

(exact  Gaussian) 

68.26 

95.44 

68.26 

95.44 

Table  1:  Results  of  the  computer-generated  example.  The  first  3  rows  give  the  normalized  mean  squared 
error  and  the  cost  (more  negative  values  are  better).  (NMSE  denotes  the  mean  squared  error  divided  by 
the  overall  variance  of  the  target).  While  the  error  for  Phase  I  and  Phase  II  is  identical  (frozen  weights 
to  y- unit),  the  cost  decreases  after  Phase  II.  Row  4  compares  to  the  values  of  the  exact  model.  Row  5 
gives  the  correlation  coefficient  between  the  network’s  predictions  for  the  standard  error  (i.e. ,  the  square 
root  of  the  activation  of  the  t;-unit)  and  the  actually  occurring  Ll-errors,  | d(x)  —  y(x) |.  Row  6  gives  the 
value  obtained  by  comparing  the  value  a  (i.e.,  the  ideal  model)  with  the  residual  errors.  Rows  7  and  8 
give  the  percentage  of  residuals  smaller  than  one  and  two  standard  deviations. 

4  Example  ^2:  Laser  Data  Set  From  the  Santa  Fe  Competition 

We  now  apply  our  method  to  a  set  of  observed  data,  the  1000-point  laser  intensity  series  from  the  Santa 
Fe  competition.2  Viewing  time  series  prediction  as  function  approximation  or  nonlinear  regression,  it 
would  make  it  easier  for  the  predictor  if  we  had  more  points  that  lie  on  the  manifold.  (Figure  3  shows  the 
manifold  the  network  is  trying  to  approximate.)  In  the  competition,  Sauer  (1994)  upsampled  the  1000 
available  data  points  with  an  FFT  method  by  a  factor  of  32.  This  does  not  change  the  effective  sampling 
rate,  but  it  “fills  in”  more  points.  We  use  the  same  upsampling  trick  (without  filtered  embedding),  and 
randomly  split  the  resulting  31200  data  points  into  two  equal-sized  sets,  A  and  B. 

Phase  I:  We  train  a  simple  network  with  25  inputs  (corresponding  to  25  past  values),  12  tanh  hidden 
units,  and  1  linear  output  for  single-step  prediction.3  (Learning  rate  rj  —  10"'.)  At  the  end  of  each 
epoch,  we  monitor  not  only  the  single-step  errors  for  the  cross-validation  set  R,  but  also  something  else: 
in  view  of  the  competition  task  of  iteratively  predicting  100  values  (with  error  bars),  we  pick  500  starting 
positions  at  random  from  the  original  series  (from  points  26-900).  We  then  compute  the  100-point  iterated 
predictions  for  each  of  these  starting  position.  Phase  I  stops  when  the  average  error  of  the  100  predictions 
following  the  500  starting  points  has  reached  a  minimum.  Interestingly,  this  minimum  occurs  well  before 
the  minimum  of  single-step  prediction  error  on  set  B. 

Phase  II:  We  freeze  the  weights  from  Phase  I,  and  add  the  12  tanh  hidden  units  as  shown  in  Figure  1. 
We  train  the  'u-unit  on  the  squared  errors  from  set  B  with  y  =  10" 9 .  The  final  weights  for  Phase  II  are 
taken  at  the  location  of  the  error  minimum  on  set  A. 

Phase  III:  All  weights  are  unfrozen,  and  the  composite  network  is  trained  with  rj  =  10”9.  We  monitor 
both  the  cost  C  and  the  average  error  over  the  500  iterated  sequences.  The  final  network  is  the  one  at 
the  minimum  of  the  iterated  prediction  measure. 

The  competition  task  for  this  data  set  was  to  submit  a  100-point  continuation  with  error  bars.  None  of 
the  14  submissions  estimated  the  error  bars  in  a  principled  way.  Our  method,  as  discussed  so  far,  does 
not  yet  provide  uncertainty  estimates  for  iterated  predictions  either.  We  now  show  how  our  single-step 
uncertainty  estimates  can  be  used  for  iterated  predictions:  In  addition  to  the  point  prediction  of  the  next 
time  step,  we  make  predictions  at  ±  one  standard  deviation  of  the  most  recent  value  of  the  input  vector. 
If  the  distance  between  either  of  these  two  predictions  is  greater  than  the  single-step  uncertainty  estimate, 
this  distance  is  used  as  the  current  uncertainty  estimate.  Because  the  competition  log  likelihood  measure 
is  highly  sensitive  to  even  a  single  error  bar  that  is  too  small  (Gershenfeld  &  Weigend,  1994,  pp.  64-65),  we 
impose  a  lower  bound  of  4.0  on  the  variance  in  calculating  the  iterated  uncertainties.  Table  2  summarizes 
the  results.  We  would  like  to  point  out  that  our  method  of  predicting  errors  could  have  been  followed  by 
any  competition  participant —  the  only  data  we  use  for  model  building  are  precisely  the  1000  points  that 
were  available  during  the  competition. 

2  The  data  set  and  several  predictions  and  characterizations  are  described  in  the  volume  edited  by  Weigend  Sz  Gershen¬ 
feld  (1994).  The  data  is  available  by  anonymous  ftp  at  ftp.es  .Colorado  .edu  in  /pub/Time-Series/SantaFe  as  A.dat. 

3 These  are  the  same  dimensions  as  Wan  (1994)  used;  we  did  not  try  any  other  architectures. 
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Train 

900-pt  single  st.  900-1000  iterated 
NMSE  NMSE 

Test 

1-100  single  st.  1-100  iterated 

NMSE  NMSE  NALL 

Naive  Baseline 

— 

— 

— 

1.001 

8.50 

Sauer (1994) 

— 

— 

— 

0.080 

4.84 

Wan  (1994) 

Network 

0.0004 

0.0026 

0.0230 

0.055 

— 

Competition  Entry 

— 

— 

— 

0.028 

3.48 

This  Paper 

Network 

0.0010 

0.0142 

0.0198 

0.096 

— 

Competition  Conditions 

— 

— 

— 

0.016 

3.28 

Table  2:  Results  for  the  laser  data,  both  for  single  step  (“single  st.”)  and  iterated  predictions.  The  “naive 
baseline”  corresponds  to  simply  using  the  mean  of  the  training  data  as  predictions  and  the  standard 
deviation  of  the  training  data  as  error  bars  for  every  point.  Wan’s  (1994)  competition  entry  consisted  of 
75  points  generated  by  his  network,  followed  by  25  points  (after  the  predicted  collapse)  from  a  similar 
stretch  that  he  picked  from  the  training  data  by  visual  inspection.  To  facilitate  the  comparison,  after  our 
network  indicates  a  collapse  (based  on  a  sudden  jump  in  uncertainty),  we  also  uses  the  mean  of  training 
points  620-700  as  the  prediction,  and  the  standard  deviation  of  this  segment  as  the  expected  error  after 
the  collapse.  Boldface  denotes  the  measures  used  in  the  Santa  Fe  competition. 


(a)  (b) 


Figure  3:  Results  for  the  laser  data:  (a)  Histogram  of  residual  errors.  The  slight  kurtosis  suggests  non- 
uniform  variance,  but  less  than  in  Figure  2b.  (b)  State-space  embedding:  the  grey  scale  indicates  the 
degree  of  certainty  (v(x)  from  lightest  to  darkest,  <  3,  3-5,  5-10,  >  10).  Different  regions  have  clearly 
different  degrees  of  predictability.  This  shows  that  it  is  a  good  idea  to  get  local  estimates  of  uncertainty! 


5  Discussion  and  Relation  to  Other  Work 

In  summary,  we  have  combined  Sauer’s  upsampling  trick  with  our  method  of  estimating  uncertainties, 
both  for  single-step  and  iterated  predictions.  The  key  feature  is  that  we  start  with  the  maximum  likelihood 
principle  and  arrive  at  local  error  bars  that  depend  on  the  location  in  input  space.  Comparing  our  result 
for  the  laser  benchmark  data  (shown  in  Figure  4)  with  the  top  two  Santa  Fe  competition  entries  (in 
Table  2)  demonstrates  both  tne  ease  with  which  a  simple  network  approximates  the  manifold  given 
sufficient  data,  and  the  applicability  of  our  method  for  predicting  impredictability. 

Our  method  encompasses  most  sources  of  uncertainty,  such  as  uncertainty  due  to  stochasticity  (outside 
shocks,  measurement  noise),  and  uncertainty  due  to  the  divergence  of  nearby  trajectories  (particularly 
large  after  the  collapses).  It  indirectly  also  handles  model  misspecification  by  appropriately  overestimating 
the  variance.  It  does  not  take  into  account  the  uncertainty  of  the  predictor  due  to  specific  splits  of  the 
data  into  training,  cross-validation,  and  test  sets.  Weigend  &  LeBaron  (1994)  use  a  bootstrapping  method 
and  show  on  an  example  (financial  data)  that  this  source  of  uncertainty  is  significantly  more  important 
than  model  and  evaluation  uncertainty  due  to  other  choices  (e.g.,  initial  weights).  However,  since  we  here 
wanted  to  followed  the  Santa  Fe  rules  as  closely  as  possible,  we  did  not  bootstrap  the  test  set. 

When  dealing  with  finite  sets  of  noisy  data,  overfitting  the  function  is  already  a  serious  problem.  If  we  also 
want  to  estimate  local  error  bars,  not  only  must  we  fear  overfitting  t/(x),  but  we  must  also  be  concerned 
with  overfitting  u(x).  If  the  standard  method  of  starting  with  small  weights  and  stopping  at  the  minimum 
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Figure  4:  Iterated  predictions  for  the  laser  data  on  the  continuation  set  (never  used  in  training).  Note  the 
large  errors  and  large  error  bars  in  the  region  of  great  uncertainty  around  Time=66.  In  the  region  after 
the  collapse,  we  follow  Wan  (1994)  and  use  as  prediction  values  the  mean  of  the  post-collapse  training 
data  (from  training  points  620-700),  and  as  (global)  error  bars  the  standard  deviation  of  the  same  training 
region. 


of  an  appropriate  cross-validation  set  does  not  suffice  for  a  given  problem,  it  is  straightforward  to  employ 
the  usual  anti-overfitting  weaponry  (smooth  v  as  a  function  of  x,  pruning,  weight-elimination,  etc.). 

In  any  example,  we  have  to  choose  a  specific  error  model.  The  framework  presented  here  encompasses 
any  distribution  with  a  location  parameter  (conditional  mean)  and  a  scale  parameter  (local  error  bar). 
The  framework  also  carries  over  from  regression  to  classification  where  it  allows  to  quantify  the  amount 
of  uncertainty  of  a  probability  estimate  (of  a  pattern  belonging  to  a  certain  class)  by  giving  the  width  of 
that  distribution,  again  depending  on  the  input  pattern. 
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